Demystifying Data

05 - ML and Models and Data - Oh My!!!

2024-01-11

What is Machine Learning?

A Word of Caution

Dan Ariely

Big Data is like teenage sex:

everyone talks about it

nobody really knows how to do it

everyone thinks everyone else is doing it

so everyone claims they are doing it.

Terminology

Different meanings

Machine Learning


Artificial Intelligence


Statistical Learning


Applied Statistics

Historical context important


ML primarily from CS / EE

‘Engineering’ mentality

Data Format


Tabular based


   default student   balance    income
1       No      No  729.5265 44361.625
2       No     Yes  817.1804 12106.135
3       No      No 1073.5492 31767.139
4       No      No  529.2506 35704.494
5       No      No  785.6559 38463.496
6       No     Yes  919.5885  7491.559
7       No      No  825.5133 24905.227
8       No     Yes  808.6675 17600.451
9       No      No 1161.0579 37468.529
10      No      No    0.0000 29275.268
11      No     Yes    0.0000 21871.073
12      No     Yes 1220.5838 13268.562
13      No      No  237.0451 28251.695
14      No      No  606.7423 44994.556
15      No      No 1112.9684 23810.174

Exchangeability

Predictive Focus


Predictive accuracy


De-emphasises inference / uncertainty / explainability

Discoverability of model parameters

Example of linear models

Production


Scaling issues


Automated ML pipelines


Software engineering

Model Validation

Overfitting

Bias-Variance Tradeoff


\[\begin{eqnarray*} \text{Bias} &=& \text{under-complexity error} \\ \text{Variance} &=& \text{over-complexity error} \end{eqnarray*}\]

Cross-validation


Training-test split


\(k\)-fold


Train-validation-test split

Supervised Learning

Labelled data

\[ \begin{eqnarray*} \text{Discrete output} &\rightarrow& \text{Categorisation} \\ \text{Continuous output} &\rightarrow& \text{Regression} \end{eqnarray*} \]

Linear Models


Assumes data follows distributional form


Linear in parameters

Decision / Regression Trees

   default student   balance    income
1       No      No  729.5265 44361.625
2       No     Yes  817.1804 12106.135
3       No      No 1073.5492 31767.139
4       No      No  529.2506 35704.494
5       No      No  785.6559 38463.496
6       No     Yes  919.5885  7491.559
7       No      No  825.5133 24905.227
8       No     Yes  808.6675 17600.451
9       No      No 1161.0579 37468.529
10      No      No    0.0000 29275.268
11      No     Yes    0.0000 21871.073
12      No     Yes 1220.5838 13268.562
13      No      No  237.0451 28251.695
14      No      No  606.7423 44994.556
15      No      No 1112.9684 23810.174

Simple to understand


Highly explainable


Prone to overfitting

Random Forest


Ensemble of trees


Aggregate low-bias trees to reduce variance

Sample of rows, constrain splits


Self-tuning (mostly)

Boosting


Ensemble of trees


Aggregate low-variance trees to reduce bias

Probably most performant approach


Tuning more involved

Support Vector Machines (SVM)


Geometric method


Divides ‘feature space’ into regions

Neural Networks

Unsupervised Learning

Unlabelled data

Clustering

Real-world Example

Topic Modelling


Dimensionality Reduction

Natural Language Processing

More prevalent recently


Supervised / Unsupervised / Semi-supervised


Google Translate

Uses neural networks


Very large models (APIs)

Entity Extraction


IRISHMEN AND IRISHWOMEN: In the name of God and of the dead generations from which she receives her old tradition of nationhood, Ireland, through us, summons her children to her flag and strikes for her freedom.

doc_id token lemma upos relation
1 IRISHMEN Irishmen NOUN root
1 AND and CCONJ cc
1 IRISHWOMEN IRISHWOMEN NOUN conj
1 : : PUNCT punct
1 In in ADP case
1 the the DET det
1 name name NOUN obl
1 of of ADP case
1 God God PROPN nmod
1 and and CCONJ cc
1 of of ADP case
1 the the DET det
1 dead dead ADJ amod
1 generations generation NOUN conj
1 from from ADP case

Latent Dirichlet Allocation (LDA)


Unsupervised (clustering)


Topic modelling


Lots of functionality

Assign “topics” to each “document”

Reference: http://dontloo.github.io/blog/lda/

word2vec


Words as vectors


Semantic meaning

\[ \text{King} - \text{Male} + \text{Female} \approx \text{Queen} \]


\[ \text{Paris} - \text{France} + \text{UK} \approx \text{London} \]

Summary

Thank You


mcooney@describedata.com


https://kaybenleroll.github.io/data_workshops/talk_cirdas_master_202311/